Object localization using machine learning

ABSTRACT

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining a location of a particular object relative to a vehicle. In one aspect, a method includes obtaining sensor data captured by one or more sensors of a vehicle. The sensor data is processed by a convolutional neural network to generate a sensor feature representation of the sensor data. Data is obtained which defines a particular spatial region in the sensor data that has been classified as including sensor data that characterizes the particular object. An object feature representation of the particular object is generated from a portion of the sensor feature representation corresponding to the particular spatial region. The object feature representation of the particular object is processed using a localization neural network to generate the location of the particular object relative to the vehicle.

CROSS REFERENCE TO RELATED APPLICATION

This patent application is a continuation (and claims the benefit ofpriority under 35 USC 120) of U.S. patent application Ser. No.16/151,880, filed Oct. 4, 2018. The disclosure of the prior applicationis considered part of (and is incorporated by reference in) thedisclosure of this application.

BACKGROUND

This specification relates to autonomous vehicles.

Autonomous vehicles include self-driving cars, boats, and aircraft.Autonomous vehicles use a variety of on-board sensors and computersystems to detect nearby objects and use such detections to make controland navigation decisions.

SUMMARY

This specification describes a system implemented as computer programson one or more computers in one or more locations that performs objectlocalization.

According to a first aspect there is provided a method which includesobtaining sensor data captured by one or more sensors of a vehicle. Thesensor data is processed using a convolutional neural network togenerate a sensor feature representation of the sensor data. Data isobtained which defines a particular spatial region in the sensor datathat has been classified as including sensor data that characterizes aparticular object in an environment in a vicinity of the vehicle. Anobject feature representation of the particular object is generated froma portion of the sensor feature representation corresponding to theparticular spatial region. The object feature representation of theparticular object is processed using a localization neural network togenerate an output characterizing a location of the particular object inthe environment relative to the vehicle.

In some implementations, sensor data captured by one or more of: a lasersensor of the vehicle, a radar sensor of the vehicle, and a camerasensor of the vehicle, is aligned and combined.

In some implementations, the particular spatial region in the sensordata that has been classified as including sensor data thatcharacterizes a particular object in the environment in the vicinity ofthe vehicle is defined by a bounding box with a rectangular geometry.

In some implementations, the particular spatial region in the sensordata that has been classified as including sensor data thatcharacterizes a particular object in an environment in a vicinity of thevehicle is generated by processing at least part of the sensor datausing an object detection neural network.

In some implementations, the object feature representation of theparticular object is generated from a portion of the sensor datacorresponding to the particular spatial region in addition to theportion of the sensor feature representation corresponding to theparticular spatial region.

In some implementations, to generate the object feature representationof the particular object, the portion of the sensor featurerepresentation corresponding to the particular spatial region is croppedand transformed using one or more pooling operations.

In some implementations, the localization neural network is configuredto generate coordinates characterizing a position of a center of theparticular object in the environment. The coordinates may be expressedin a coordinate system which is defined relative to the vehicle.

In some implementations, the localization neural network is configuredto generate a distance value characterizing a distance of the particularobject in the environment from the vehicle.

According to a second aspect there is provided a system, including adata processing apparatus and a memory in data communication with thedata processing apparatus and storing instructions that cause the dataprocessing apparatus to perform the operations of the previouslydescribed method.

According to a third aspect there is provided one or more non-transitorycomputer storage media storing instructions that when executed by one ormore computers cause the one or more computers to perform the operationsof the previously described method.

Particular embodiments of the subject matter described in thisspecification can be implemented so as to realize one or more of thefollowing advantages.

The object localization system described in this specification canaccurately determine the location of an object relative to a vehiclebased on spatial data maps representing sensor data captured by thesensors of the vehicle. Therefore, the object localization system can bedeployed in a vehicle on-board system to enable the on-board system tomake fully-autonomous or partly-autonomous driving decisions, presentinformation to the driver of the vehicle to assist the driver inoperating the vehicle safely, or both.

The object localization system described in this specification can betrained to directly predict the locations of objects relative to thevehicle from spatial data maps representing sensor data captured by thesensors of the vehicle. In particular, the object localization systemdescribed in this specification can learn to implicitly recognizeobject-specific features (e.g., size and shape) and contextual features(e.g., position relative to the road) of objects from spatial data mapswithout being explicitly programmed to do so. In contrast, someconventional systems perform object localization by processing largenumbers of hand-crafted features (e.g., the possible heights of trafficlights, the shapes of traffic cones, and the like) which must beexplicitly specified as inputs to the conventional system. The objectlocalization system described in this specification may achieve higherlocalization accuracy than these conventional systems and does notnecessitate laborious hand-crafted feature engineering.

The object localization system described in this specification can beco-trained to generate auxiliary outputs in addition to objectlocalization outputs (e.g., auxiliary outputs which characterize thegeometry of the environment in the vicinity of the vehicle). Co-trainingthe object localization system to generate auxiliary outputs can (insome cases) enable the object localization system described in thisspecification to reach an acceptable performance level (e.g., objectlocalization accuracy) over fewer training iterations than someconventional object localization systems. Therefore, training the objectlocalization system described in this specification may consume fewercomputational resources (e.g., memory, computing power, or both) thantraining some conventional object localization systems.

The details of one or more embodiments of the subject matter of thisspecification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages of thesubject matter will become apparent from the description, the drawings,and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an example on-board system.

FIG. 2 is a block diagram of an example object localization system.

FIG. 3 is a block diagram of an example training system.

FIG. 4 is a flow diagram of an example process for generating objectlocalization data characterizing the location of a particular objectrelative to a vehicle.

FIG. 5 is a flow diagram of an example process for updating theparameter values of a training object localization system.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

This specification describes how a vehicle can use an objectlocalization system to determine the locations, relative to the vehicle,of objects in the environment in the vicinity of the vehicle. Todetermine the location of a particular object relative to the vehicle,the object localization system processes (e.g., using a convolutionalneural network) spatial data maps representing sensor data captured bysensors of the vehicle to generate a feature representation of thesensor data. The object localization system generates a featurerepresentation of the particular object from a portion of the featurerepresentation of the sensor data that characterizes the particularobject. Subsequently, the object localization system processes thefeature representation of the particular object using a localizationneural network to determine the location of the particular objectrelative to the vehicle. The feature representation of the particularobject implicitly characterizes various properties of the particularobject which may be useful for determining the location of theparticular object relative to the vehicle. For example, the featurerepresentation of the particular object may characterize the size andshape of the particular object, and laser and radar sensor measurementscorresponding to the particular object. As another example, the featurerepresentation of the particular object may characterize the position ofthe particular object relative to the road and the relationships betweenthe particular object and other nearby objects.

The vehicle can use the output of the object localization system toperform actions which cause the vehicle to operate more safely. Forexample, in response to determining that the location of an object(e.g., a pedestrian) is in the future trajectory of the vehicle, aplanning system of the vehicle can automatically apply the brakes of thevehicle or otherwise automatically change the future trajectory of thevehicle to prevent a collision between the object and the vehicle. Asanother example, in response to determining that the location of anobject is in the future trajectory of the vehicle, a user interfacesystem can present an alert message to the driver of the vehicle withinstructions to adjust the future trajectory of the vehicle or applyvehicle brakes prior to collision.

These features and other features are described in more detail below.

FIG. 1 is a block diagram of an example on-board system 100. Theon-board system 100 is composed of hardware and software components,some or all of which are physically located on-board a vehicle 102. Insome cases, the on-board system 100 can make fully-autonomous orpartly-autonomous driving decisions (i.e., driving decisions takenindependently of the driver of the vehicle 102), present information tothe driver of the vehicle 102 to assist the driver in operating thevehicle safely, or both. For example, in response to determining thatthe vehicle 102 is likely to collide with an object, the on-board system100 may autonomously apply the brakes of the vehicle 102 or otherwiseautonomously change the trajectory of the vehicle 102 to prevent acollision between the vehicle 102 and the object. As another example, inresponse to determining that the vehicle 102 is likely to collide withan object, the on-board system 100 may present an alert message to thedriver of the vehicle 102 with instructions to adjust the trajectory ofthe vehicle 102 to avoid a collision.

Although the vehicle 102 in FIG. 1 is depicted as an automobile, and theexamples in this document are described with reference to automobiles,in general the vehicle 102 can be any kind of vehicle. For example,besides an automobile, the vehicle 102 can be a watercraft or anaircraft. Moreover, the on-board system 100 can include componentsadditional to those depicted in FIG. 1 (e.g., a collision detectionsystem or a navigation system).

The on-board system 100 includes a sensor system 104 which enables theon-board system 100 to “see” the environment in the vicinity of thevehicle 102. More specifically, the sensor system 104 includes one ormore sensors, some of which are configured to receive reflections ofelectromagnetic radiation from the environment in the vicinity of thevehicle 102. For example, the sensor system 104 can include one or morelaser sensors (e.g., LIDAR laser sensors) that are configured to detectreflections of laser light. As another example, the sensor system 104can include one or more radar sensors that are configured to detectreflections of radio waves. As another example, the sensor system 104can include one or more camera sensors that are configured to detectreflections of visible light.

In some implementations, the sensor system 104 includes a combination ofshort-range and long-range laser sensors. The short-range laser sensorscan be used to detect the ground surrounding vehicle 102 and objectsnear the vehicle 102 (e.g., objects within 40 meters of the vehicle102). The long-range laser sensors can be used to detect objects whichare farther away from the vehicle 102 (e.g., objects up to 80 metersaway from the vehicle 102).

The sensor system 104 continuously (i.e., at each of multiple timepoints) captures raw sensor data which can indicate the directions,intensities, and distances travelled by reflected radiation. Forexample, a sensor in the sensor system 104 can transmit one or morepulses of electromagnetic radiation in a particular direction and canmeasure the intensity of any reflections as well as the time that thereflection was received. A distance can be computed by determining thetime which elapses between transmitting a pulse and receiving itsreflection. Each sensor can continually sweep a particular space inangle, azimuth, or both. Sweeping in azimuth, for example, can allow asensor to detect multiple objects along the same line of sight.

The sensor system 104 is configured to process the raw sensor datacaptured at a time point to generate one or more spatial data maps 106that represent the raw sensor data captured at the time point. Eachspatial data map 106 can be represented as a matrix (e.g., a two-,three-, or four-dimensional matrix) of numerical values. For example,the sensor system 104 may generate spatial data maps 106 representingthe raw sensor data captured by the laser sensors, the radar sensors, orboth, which characterize the distance from the vehicle 102 to each ofmultiple points in the vicinity of the vehicle. As another example, thesensor system 104 may generate spatial data maps 106 representing theraw sensor data captured by the camera sensors which characterize thevisual appearance of the environment in the vicinity of the vehicle byone or more photographic images (e.g., red-green-blue (RGB) images).

The on-board system 100 includes an object localization system 108 whichis configured to process the spatial data maps 106 to generate objectlocalization data 110 characterizing the locations, relative to thevehicle 102, of different objects in the vicinity of the vehicle 102.For example, as will be described in more detail with reference to FIG.2, the object localization data 110 may characterize the location of anobject in the vicinity of the vehicle 102 by a set of numericalcoordinates indicating the location of the object in a coordinatesystem. The example object localization data 112 depicted in FIG. 1characterizes the location of object A by the Cartesian coordinates [12,7, 0.5] and the location of object B by the Cartesian coordinates [14,2, 2]. As another example, the object localization data 110 maycharacterize the location of an object in the vicinity of the vehicle102 by a numerical value indicating the distance (e.g., measured infeet) of the object from the vehicle 102. Objects in the vicinity of thevehicle 102 may include, for example, pedestrians, bicyclists, animals,other vehicles, road signs, pylons, and traffic lights.

The on-board system 100 can provide the object localization data 110generated by the object localization system 100 to a planning system114, a user interface system 116, or both.

When the planning system 114 receives the object localization data 110,the planning system 114 can use the object localization data 110 to makefully-autonomous or partly-autonomous planning decisions which plan thefuture trajectory of the vehicle. The planning decisions generated bythe planning system can, for example, include: yielding (e.g., to othervehicles), stopping (e.g., at a Stop sign), passing other vehicles,adjusting vehicle lane position to accommodate a bicyclist, slowing downin a school or construction zone, merging (e.g., onto a highway), andparking. For example, the planning system 114 can autonomously generateplanning decisions to navigate the vehicle 102 to avoid a collision withan object by changing the future trajectory of the vehicle 102 to avoidthe object. In a particular example, the planning system 114 cangenerate planning decisions to apply the brakes of the vehicle 102 toavoid a collision with a pedestrian in the current trajectory of thevehicle 102. As another example, in response to receiving objectlocalization data 110 indicating that a “Stop” sign is adjacent to thefuture trajectory of the vehicle 102, the planning system can generateplanning decisions to apply the brakes of the vehicle 102 before theStop sign.

The planning decisions generated by the planning system 114 can beprovided to a control system of the vehicle. The control system of thevehicle can control some or all of the operations of the vehicle byimplementing the planning decisions generated by the planning system.For example, in response to receiving a planning decision to apply thebrakes of the vehicle, the control system of the vehicle may transmit anelectronic signal to a braking control unit of the vehicle. In responseto receiving the electronic signal, the braking control unit canmechanically apply the brakes of the vehicle.

When the user interface system 116 receives the object localization data110, the user interface system 116 can use the object localization data110 to present information to the driver of the vehicle 102 to assistthe driver in operating the vehicle 102 safely. The user interfacesystem 116 can present information to the driver of the vehicle 102 byany appropriate means, for example, by an audio message transmittedthrough a speaker system of the vehicle 102 or by alerts displayed on avisual display system in the vehicle (e.g., an LCD display on thedashboard of the vehicle 102). In a particular example, the userinterface system 116 can present an alert message to the driver of thevehicle 102 with instructions to adjust the trajectory of the vehicle102 to avoid a collision with a pedestrian in the current trajectory ofthe vehicle 102.

FIG. 2 is a block diagram of an example object localization system 200.The object localization system 200 is an example of a system implementedas computer programs on one or more computers in one or more locationsin which the systems, components, and techniques described below areimplemented.

The object localization system 200 is configured to process spatial datamaps 202 representing sensor data captured by sensors on a vehicle togenerate object localization data 204 characterizing the locations,relative to the vehicle, of one or more objects in the vicinity of thevehicle. As described with reference to FIG. 1, each spatial data map202 can represent raw sensor data captured by laser sensors, radarsensors, camera sensors, or other vehicle sensors, and may berepresented as a matrix (e.g., a two- or three-dimensional matrix) ofnumerical values. In some cases, each spatial data map can berepresented as a matrix where two of the dimensions of the matrix are“spatial dimensions”. A spatial location in a spatial data map refers toa position in the spatial data map indexed by values of the spatialdimension coordinates. Each spatial location in a spatial data mapcharacterizes a respective spatial region of the environment. In somecases, some or all of the spatial data maps 202 processed by the objectlocalization system 200 have the same perspective of the environment andare aligned with one another. That is, corresponding spatial locationsin the different spatial data maps characterize the same spatial regionin the environment.

In some implementations, the object localization system 200 may beconfigured to process a roadmap spatial data map which defines knownpositions of static objects (e.g., traffic lights, crosswalks, and thelike) in the environment in the vicinity of the vehicle. In someimplementations, the object localization system 200 may be configured toprocess an elevation spatial data map which defines known elevations(e.g., in feet above sea level) of positions in the environment in thevicinity of the vehicle. In some implementations, the objectlocalization system 200 may be configured to process sensor data fromadditional vehicle sensors such as heat sensors and audio sensors.

In some implementations, the system 200 generates object localizationdata 204 which characterizes the location of an object in the vicinityof the vehicle by a numerical value indicating the distance (e.g.,measured in feet) of the object from the vehicle. In someimplementations, the system 200 generates object localization data 204which characterizes the location of an object in the vicinity of thevehicle by a set of numerical coordinates indicating the location of theobject in a coordinate system that is defined relative to the vehicle.For example, the coordinate system may be a Cartesian coordinate system(e.g., centered on the vehicle), and the object localization data 204may characterize the location of the object by an (x, y, z) coordinateof the center of the object in the Cartesian coordinate system. Asanother example, the coordinate system may be a spherical coordinatesystem (e.g., centered on the vehicle), and the object localization data204 may characterize the location of the object by an (r, θ, ϕ)coordinate of the center of the object in the spherical coordinatesystem.

In the example depicted by 206, the object localization data 204characterizes the location of a “Stop” sign in the vicinity of thevehicle by the coordinates of the center of the “Stop” sign in aCartesian coordinate system centered on the vehicle.

For convenience, the description which follows refers to the system 200generating object localization data 204 which characterizes thelocation, relative to the vehicle, of a particular object in thevicinity of the vehicle. More generally, the system 200 can generateobject localization data 204 which characterizes the location, relativeto the vehicle, of any number of objects in the vicinity of the vehicle.

To generate object localization data 204 which characterizes thelocation of a particular object relative to the vehicle, the system 200first obtains data defining the position of the particular object in thespatial data maps 202. More specifically, the system 200 obtains datawhich defines a particular spatial region in the spatial data maps 202that has been classified as including sensor data that characterizes theparticular object. A spatial region in a spatial data map refers to aset of spatial locations in the spatial data map. For example, the datadefining the position of the particular object in the spatial data maps202 may be an object bounding box 208 which is defined by thecoordinates of a bounding box delineating a particular spatial region inthe spatial data maps 202 that represents the particular object. In thedescription which follows, while the data defining the position of theparticular object in the spatial data maps 202 may be referred to as anobject bounding box 208, it should be understood that the data definingthe position of the particular object in the spatial data maps 202 canhave any other appropriate format. For example, the data defining theposition of the particular object in the spatial data maps 202 maydefine a non-rectangular (e.g., circular) spatial region in the spatialdata maps 202. In some implementations, the data defining the positionof the particular object in the spatial data maps 202 may be an outputgenerated by an object detection neural network by processing an inputincluding the spatial data maps 202.

While the object bounding box 208 defines the position of the particularobject in the spatial data maps 202, to generate the object localizationdata 204, the system 200 must determine the location of the particularobject relative to the vehicle. For example, the system 200 may beconfigured to translate from data defining a bounding box around theparticular object in a set of spatial data maps to 3D localization datacharacterizing the position of the particular object in 3D spacerelative to the vehicle. Even when the spatial data maps 202 includesensor data (e.g., from laser or radar sensors) characterizing thedistance from the vehicle to points in the vicinity of the vehicle, theobject bounding box 208 may not directly define the object localizationdata 204. For example, a bounding box around the particular object in aspatial data map representing laser or radar sensor data may includedata defining distances from the vehicle to points behind the particularobject and in front of the particular object, in addition to points onthe particular object.

The system 200 processes the spatial data maps 202 using a convolutionalneural network 210, in accordance with trained values of convolutionalneural network parameters, to generate a sensor feature representation212. The sensor feature representation 212 can be represented in anyappropriate numerical format, for example, as a multi-dimensional matrix(i.e., tensor) of numerical values. In some cases, the sensor featurerepresentation 212 can be represented as a three-dimensional matrix,where two of the dimensions of the matrix are “spatial dimensions”. Aspatial location in the sensor feature representation 212 refers to aposition in the sensor feature representation indexed by values of thespatial dimension coordinates. Each spatial location in the sensorfeature representation characterizes a respective spatial region of thespatial data maps. The sensor feature representation 212 can beunderstood as an alternative representation of the spatial data maps 202which represents complex data interactions within and between spatialdata maps in a form which can be effectively processed to determine thelocations of objects in the vicinity of the vehicle. Optionally, theconvolutional neural network 210 may additionally process a roadmapspatial data map, an elevation spatial data map, and sensor datagenerated by additional vehicle sensors (e.g., audio and heat sensors),as described earlier.

The system 200 includes a cropping engine 214 which is configured togenerate an object feature representation 216 of the particular objectfrom a portion of the sensor feature representation 212 corresponding tothe particular spatial region in the spatial data maps 202 defined bythe object bounding box 208. For example, as will be described in moredetail with reference to FIG. 4, the cropping engine 214 can generatethe object feature representation 216 by cropping the portion of thesensor feature representation 212 corresponding to a bounding box aroundthe particular object in the spatial data maps 202 which is defined bythe object bounding box 208. In some implementations, the croppingengine 214 may additionally be configured to generate the object featurerepresentation 216 of the particular object from a portion of the one ormore spatial data maps 202 corresponding to the particular spatialregion defined by the object bounding box 208.

The system 200 processes the object feature representation 216 using alocalization neural network 218, in accordance with current values oflocalization neural network parameters, to generate the objectlocalization data 204 characterizing the location of the particularobject relative to the vehicle. The localization neural network 218 canbe implemented using any appropriate neural network architecture. Forexample, the localization neural network 218 may include one or moreconvolutional neural network layers followed by one or morefully-connected neural network layers.

In some implementations, the localization neural network 218 isconfigured to generate numerical output data which directlycharacterizes the location of the particular object relative to thevehicle. For example, the output layer of the localization neuralnetwork 218 may be defined by a single neuron that generates anactivation characterizing a distance from the vehicle to the particularobject. As another example, the output layer of the localization neuralnetwork 218 may be defined by a set of multiple neurons that eachcorrespond to a respective dimension of a coordinate system, and theactivations of these neurons may characterize a location of theparticular object in the coordinate system.

In some implementations, the localization neural network 218 isconfigured to generate data defining a probability distribution over apredetermined set of possible locations of the object relative to thevehicle. For example, the output layer of the localization neuralnetwork 218 may be a soft-max layer including multiple neurons whicheach correspond to a respective location in a predetermined lattice ofpossible locations of the particular object relative to the vehicle. Inthis example, the activation of each neuron in the output layer maydefine a respective probability that the particular object is situatedat the location corresponding to the neuron. In some of theseimplementations, the system 200 may generate the object localizationdata 204 by sampling a possible location of the particular objectrelative to the vehicle in accordance with the probability distribution.Alternatively, the system 200 may generate the object localization data204 by selecting a possible location of the particular object relativeto the vehicle that has a highest probability value according to theprobability distribution generated by the localization neural network218.

As described above, the object localization data 204 generated by thesystem 200 can be processed by other components of an on-board system ofa vehicle to, for example, make fully-autonomous or partly-autonomousdriving decisions.

The trained parameter values of the convolutional neural network 210,the localization neural network 218, or both can be provided to theobject localization system 200 by a training system, as will bedescribed further with reference to FIG. 3.

FIG. 3 is a block diagram of an example training system 300. Thetraining system 300 is an example of a system implemented as computerprograms on one or more computers in one or more locations in which thesystems, components, and techniques described below are implemented.

The training system 300 is configured to generate a set of trainedparameter values 302 for the parameters of the convolutional neuralnetwork, the localization neural network, or both of the objectlocalization system in the on-board system 100 of the vehicle 102. Aswill be described in more detail below, the training system 300generates trained parameter values 302 which are optimized to enable theon-board object localization system to generate accurate objectlocalization data. The training system 300 is typically hosted within adata center 304, which can be a distributed computing system havinghundreds or thousands of computers in one or more locations. Thetraining system 300 can provide the trained parameter values 302 to theon-board system 100 by a wired or wireless connection.

The training system 300 includes a training object localization system306 which includes computing devices having software or hardware modulesthat implement the respective operations of the on-board objectlocalization system (e.g., described with reference to FIG. 2). Inparticular, the training object localization system 306 implements theoperations of a training convolutional neural network and a traininglocalization neural network. The training neural networks generally havethe same (or similar) architecture as the on-board neural networks.

The training system 300 trains the training object localization system306 based on a set of training data 308. The training data 308 includesmultiple training examples, where each training example can include: (i)one or more training spatial data maps, (ii) a training object boundingbox defining a particular region in the training spatial data maps thathas been classified as including sensor data that characterizes aparticular training object in the environment in the vicinity of thevehicle, and (iii) target object localization data that characterizesthe position of the particular training object in the environmentrelative to the vehicle. The target object localization data representsthe object localization data which should be generated by the trainingobject localization system 306 by processing the training spatial datamaps and the training object bounding box included in the trainingexample.

To generate the trained parameter values 302, a training engine 310iteratively adjusts the parameter values 312 of the training objectlocalization system 306 to cause the training object localization system306 to generate object localization data outputs 314 which match thetarget object localization data specified by the training examples. Theparameter values 312 of the training object localization system 306include the parameter values of the training convolutional neuralnetwork, the training localization neural network, or both. The trainingengine 310 iteratively adjusts the parameter values 312 of the trainingobject localization system 306 by using machine learning trainingtechniques (e.g., stochastic gradient descent) to generate iterativeparameter value adjustments 316 which are used to adjust the parametervalues 312. After determining that a training termination criterion ismet, the training system 300 can output trained parameter values 302corresponding to the current parameter values 312 at the final trainingiteration.

In some implementations, the training examples included in the trainingdata 308 include additional target auxiliary outputs. For example, aswill be described further with reference to FIG. 5, the target auxiliaryoutputs may characterize a geometry of the environment in the vicinityof the vehicle, attributes of the particular training object, or futureattributes of the particular training object. In these implementations,the training object localization system 306 may be configured togenerate auxiliary outputs corresponding to some or all of the targetauxiliary outputs specified by the training examples. The trainingengine 310 can additionally adjust the parameter values 312 to cause thetraining object localization system 306 to generate auxiliary outputs318 which match the target auxiliary outputs specified by the trainingexamples. By co-training the training object localization system 306 togenerate auxiliary outputs 318 in addition to the object localizationdata 314, the training system 300 can determine the trained parametervalues 302 using fewer training iterations.

The training spatial data maps corresponding to the training examples ofthe training data 308 can be obtained from logged sensor data acquiredby vehicle sensors, from simulated sensor data acquired from simulatedvehicle sensors in a simulated environment, or both. The training objectbounding box, target object localization data, and target auxiliaryoutputs corresponding to the training examples of the training data 308can be generated by automated annotation procedures, manual annotationprocedures, or a combination of both.

To reduce overfitting during training, the training convolutional neuralnetwork and the training localization neural network may include one ormore dropout layers. The dropout layers may be configured to randomlydrop out (e.g., by setting to a constant value) one or more of thespatial data maps (e.g., the spatial data map generated from the rawsensor data of a laser sensor). In this manner, the object localizationsystem can be trained to generate accurate object localization data evenif some sensors are unavailable (e.g., due to being damaged) and togeneralize more effectively to new environments.

FIG. 4 is a flow diagram of an example process 400 for generating objectlocalization data characterizing the location of a particular objectrelative to a vehicle. For convenience, the process 400 will bedescribed as being performed by a system of one or more computerslocated in one or more locations. For example, an object localizationsystem, e.g., the object localization system 200 of FIG. 2,appropriately programmed in accordance with this specification, canperform the process 400.

The system obtains sensor data captured by sensors of the vehicle (402).The sensor data can represent data captured by laser sensors, radarsensors, camera sensors, or other vehicle sensors. Some or all of thesensor data can be represented by respective spatial data maps, that is,by respective matrices (e.g., two- or three-dimensional matrices) ofnumerical values (as described with reference to FIG. 2).

The system generates a sensor feature representation by processing thesensor data (in particular, the spatial data representing the sensordata) using a convolutional neural network in accordance with currentvalues of convolutional neural network parameters (404). To processmultiple spatial data maps using the convolutional neural network, thesystem first concatenates the spatial data maps with the sameperspective of the environment to align their respective spatiallocations. Spatial data maps with different perspectives the environmentmay be processed by different convolutional neural networks, and theoutputs of these different convolutional neural networks may jointlyform the sensor feature representation. The sensor featurerepresentation can be represented in any appropriate numerical format,for example, as a multi-dimensional matrix of numerical values.

The system obtains data which defines the position of the particularobject in the sensor data (406). More specifically, the system obtainsdata which defines a particular spatial region in the sensor data (inparticular, the spatial data maps) that has been classified as includingsensor data that characterizes the particular object. For example, datadefining the particular spatial region in the sensor data may be definedby the coordinates of a bounding box delineating a particular spatialregion in the spatial data maps that represents the particular object.In some implementations, the data defining the particular spatial regionin the sensor data (e.g., the object bounding box) is an outputgenerated by an object detection neural network by processing an inputincluding the spatial data maps.

The system generates an object feature representation of the particularobject from a portion of the sensor feature representation correspondingto the particular spatial region in the sensor data that has beenclassified as including sensor data that characterizes the particularobject (408). For example, the system can generate the object featurerepresentation by cropping the portion of the sensor featurerepresentation corresponding to the particular spatial region (e.g., asdefined by a bounding box) and using pooling operations to transform thecropped portion of the sensor feature representation to a predeterminedsize. In this example, the pooling operations may include dividing thecropped portion of the sensor feature representation into a grid of apredetermined number of sub-windows, and then max-pooling the values ineach sub-window into a corresponding position in the object featurerepresentation. By using the described pooling operations, the systemcan generate an object feature representation with a predetermined sizesuitable for processing by the localization neural network (to bedescribed in 410) regardless of the size of the particular spatialregion defined by the object bounding box.

When the sensor feature representation includes the outputs of multipledifferent convolutional neural networks which each process spatial datamaps with a respective perspective of the environment, the system maygenerate the object feature representation from respective portions ofthe output of each convolutional neural network. More specifically, thesystem can generate the object feature representation by cropping andpooling the respective portion of the output of each convolutionalneural network corresponding to the particular spatial region in thesensor data that has been classified as including sensor data thatcharacterizes the particular object.

In some cases, the spatial dimensionality of the sensor featurerepresentation may be lower than the spatial dimensionality of thespatial data maps, for example, due to convolution and poolingoperations performed in the convolutional neural network used togenerate the sensor feature representation. In a particular example, thespatial dimensionality of the spatial data maps may be 300×300, whilethe spatial dimensionality of the sensor feature representation may be250×250. In these cases, the portion of the sensor featurerepresentation corresponding to the particular spatial region in thespatial data maps (e.g., defined an object bounding box) may be adjustedto account for the reduction in spatial dimensionality of the sensorfeature representation relative to the spatial data maps. For example,the coordinates of a bounding box in the spatial data maps defined bythe object bounding box may be translated, rescaled, or both to accountfor the reduction in spatial dimensionality of the sensor featurerepresentation relative to the spatial data maps.

Optionally, the system may additionally generate the object featurerepresentation of the particular object based on the portion sensor datathat has been classified as including sensor data that characterizes theparticular object. For example, the system may crop the portion of thespatial data maps corresponding to the particular spatial region,transform the cropped portion of the spatial data maps to apredetermined size (e.g., using pooling operations, as described above),and concatenate the resulting data to the object feature representation.

The system generates the object localization data characterizing thelocation of the particular object relative to the vehicle by processingthe object feature representation using a localization neural network inaccordance with current values of localization neural network parameters(410).

In some implementations, the localization neural network is configuredto generate numerical output data which directly characterizes thelocation of the object relative to the vehicle. For example, thelocalization neural network may be configured to generate numerical datadefining numerical coordinates of the center of the object relative tothe vehicle in a coordinate system (e.g., a Cartesian coordinate system)defined relative to the vehicle.

In some implementations, the localization neural network is configuredto generate data defining a probability distribution over apredetermined set of possible locations of the object relative to thevehicle. For example, the localization neural network may be configuredto generate numerical data which includes a respective probability valuefor each location in a predetermined lattice of possible locations ofthe object relative to the vehicle. In some of these implementations,the system may generate the object localization data characterizing thelocation of the particular object relative to the vehicle by sampling apossible location of the particular object relative to the vehicle inaccordance with the probability distribution. Alternatively, the systemmay generate the object localization data characterizing the location ofthe particular object relative to the vehicle by selecting a possiblelocation of the particular object relative to the vehicle that has ahighest probability value according to the probability distributiongenerated by the localization neural network.

FIG. 5 is a flow diagram of an example process 500 for updating theparameter values of a training object localization system. Inparticular, FIG. 5 describes one iteration of an example iterativeprocess for updating the parameter values of a training objectlocalization system. For convenience, the process 500 will be describedas being performed by a training system including one or more computerslocated in one or more locations. For example, the training system 300of FIG. 3, appropriately programmed in accordance with thisspecification, can perform the process 500.

The training system obtains one or more training examples (502). Eachtraining example includes: (i) one or more training spatial data maps,(ii) a training object bounding box defining a particular region in thetraining spatial data maps that has been classified as including sensordata that characterizes a particular training object in the vicinity ofthe vehicle, and (iii) target object localization data thatcharacterizes the position of the particular training object relative tothe vehicle. Optionally, as will be described in more detail below, eachtraining example can include additional target auxiliary outputs inaddition to the target object localization data. More generally, eachtraining example can include any appropriate data defining theparticular region in the training spatial data maps that has beenclassified as including sensor data that characterizes a particulartraining object in the vicinity of the vehicle, and this data is notlimited to being represented by a bounding box.

In some implementations, each training example includes target auxiliaryoutputs characterizing a geometry of the environment in the vicinity ofthe vehicle. For example, the target auxiliary outputs characterizingthe geometry of the environment may include a target per-pixel depthmap, that is, data defining a depth value corresponding to each pixel ofthe training spatial data maps. As another example, the target auxiliaryoutputs characterizing the geometry of the environment may include datadefining spatial regions in the training spatial data maps that havebeen classified as including sensor data that characterizes objects inthe environment in the vicinity of the vehicle. In this example, thetarget auxiliary outputs characterizing the geometry of the environmentmay include data defining bounding boxes in the training spatial datamaps which enclose sensor data characterizing objects in the environmentin the vicinity of the vehicle. As another example, the target auxiliaryoutputs characterizing the geometry of the environment may include datadefining a spatial region in the training spatial data maps which hasbeen classified as including sensor data that characterizes an area ofground beneath the particular training object. In this example, thetarget auxiliary outputs characterizing the geometry of the environmentmay include data defining a bounding box in the training spatial datamaps which encloses sensor data characterizing an area of ground beneaththe particular training object.

In some implementations, each training example includes target auxiliaryoutputs characterizing attributes of the particular training object. Forexample, the target auxiliary outputs characterizing attributes of theparticular training object may include data defining the size of theparticular training object (e.g., measured in square feet). As anotherexample, the target auxiliary outputs characterizing attributes of theparticular training object may include data defining the type of theparticular training object (e.g., “vehicle”, “pedestrian”, “obstacle”,and the like).

In some implementations, each training example includes target auxiliaryoutputs characterizing future attributes of the particular trainingobject. For example, the target auxiliary outputs characterizing thefuture attributes of the particular training object may include datadefining a future location of the center of the particular trainingobject in a particular coordinate system after a predetermined amount oftime has elapsed (e.g. 1 second). As another example, the targetauxiliary outputs characterizing the future attributes of the particulartraining object may include data defining a future appearance (e.g.,visual appearance) of the particular training object after apredetermined amount of time has elapsed.

The training system may obtain one or more training examples by randomlysampling training examples from a set of training data which includesmultiple training examples. For convenience, the process 500 will bedescribed with reference to a particular training example.

The training system processes the training spatial data maps and thetraining object bounding box included in the particular training exampleusing the training object localization system to generate objectlocalization data characterizing the location of the particular trainingobject relative to the vehicle (504). Optionally, the training objectlocalization system (in particular, the training convolutional neuralnetwork, the training localization neural network, or both) may beconfigured to generate additional auxiliary outputs corresponding tosome or all of the target auxiliary outputs included in the trainingexample. That is, the training convolutional neural network, thetraining localization neural network, or both may include one or moreadditional layers (i.e., layers additional to those which are used togenerate the object localization data) which are configured to generatethe additional auxiliary outputs. To generate the additional auxiliaryoutputs, the one or more additional layers process outputs from one ormore intermediate (i.e., hidden) layers of the training convolutionalneural network or training localization neural network.

The training object localization system (e.g., the convolutional neuralnetwork, the localization neural network, or both) may be configured togenerate auxiliary outputs characterizing the geometry of theenvironment in the vicinity of the vehicle, attributes of the particulartraining object, or future attributes of the particular training object.When the training object localization system is configured to generateauxiliary outputs characterizing future attributes of the particulartraining object, the training object localization system (and the objectlocalization system in the on-board system of the vehicle) may beconfigured to process spatial data maps from multiple time points.

The training system adjusts the current parameter values of the trainingobject localization system (i.e., the current parameter values of thetraining convolutional neural network, the training localization neuralnetwork, or both) based on the object localization output data and(optionally) the auxiliary outputs (506). For example, the trainingsystem may determine an update to the current parameter values of thetraining object localization system by computing a gradient of a lossfunction with respect to the current parameter values of the trainingobject localization system. The loss function may include a respectiveterm for the object localization output and for each auxiliary outputgenerated by the training object localization system. For example, theloss function may include a mean-squared error term measuring adifference between the object localization output generated by thetraining object localization system and the target object localizationoutput specified by the particular training example. As another example,the loss function may include a cross-entropy loss term between anauxiliary output generated by the training object localization systemwhich characterizes the type of the particular training object and acorresponding target auxiliary output specified by the training example.

After computing the gradient of the loss function with respect to thecurrent parameter values of the training object localization system, thetraining system adjusts the current values of the object localizationsystem parameters using any appropriate gradient descent optimizationalgorithm update rule. Examples of gradient descent optimizationalgorithms include Adam, RMSprop, Adagrad, Adadelta, and AdaMax, amongstothers. When the training object localization system is configured togenerate auxiliary outputs, the training system can adjust the currentvalues of some or all of the training object localization systemparameters using the gradients of the loss function terms correspondingto the auxiliary outputs.

This specification uses the term “configured” in connection with systemsand computer program components. For a system of one or more computersto be configured to perform particular operations or actions means thatthe system has installed software, firmware, hardware, or a combinationof them that in operation cause the system to perform the operations oractions. For one or more computer programs to be configured to performparticular operations or actions means that the one or more programsinclude instructions that, when executed by data processing apparatus,cause the apparatus to perform the operations or actions.

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, in tangibly-embodied computer software or firmware, incomputer hardware, including the structures disclosed in thisspecification and their structural equivalents, or in combinations ofone or more of them. Embodiments of the subject matter described in thisspecification can be implemented as one or more computer programs, i.e.,one or more modules of computer program instructions encoded on atangible non-transitory storage medium for execution by, or to controlthe operation of, data processing apparatus. The computer storage mediumcan be a machine-readable storage device, a machine-readable storagesubstrate, a random or serial access memory device, or a combination ofone or more of them. Alternatively or in addition, the programinstructions can be encoded on an artificially-generated propagatedsignal, e.g., a machine-generated electrical, optical, orelectromagnetic signal, that is generated to encode information fortransmission to suitable receiver apparatus for execution by a dataprocessing apparatus.

The term “data processing apparatus” refers to data processing hardwareand encompasses all kinds of apparatus, devices, and machines forprocessing data, including by way of example a programmable processor, acomputer, or multiple processors or computers. The apparatus can alsobe, or further include, special purpose logic circuitry, e.g., an FPGA(field programmable gate array) or an ASIC (application-specificintegrated circuit). The apparatus can optionally include, in additionto hardware, code that creates an execution environment for computerprograms, e.g., code that constitutes processor firmware, a protocolstack, a database management system, an operating system, or acombination of one or more of them.

A computer program, which may also be referred to or described as aprogram, software, a software application, an app, a module, a softwaremodule, a script, or code, can be written in any form of programminglanguage, including compiled or interpreted languages, or declarative orprocedural languages; and it can be deployed in any form, including as astand-alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment. A program may, but neednot, correspond to a file in a file system. A program can be stored in aportion of a file that holds other programs or data, e.g., one or morescripts stored in a markup language document, in a single file dedicatedto the program in question, or in multiple coordinated files, e.g.,files that store one or more modules, sub-programs, or portions of code.A computer program can be deployed to be executed on one computer or onmultiple computers that are located at one site or distributed acrossmultiple sites and interconnected by a data communication network.

In this specification the term “engine” is used broadly to refer to asoftware-based system, subsystem, or process that is programmed toperform one or more specific functions. Generally, an engine will beimplemented as one or more software modules or components, installed onone or more computers in one or more locations. In some cases, one ormore computers will be dedicated to a particular engine; in other cases,multiple engines can be installed and running on the same computer orcomputers.

The processes and logic flows described in this specification can beperformed by one or more programmable computers executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby special purpose logic circuitry, e.g., an FPGA or an ASIC, or by acombination of special purpose logic circuitry and one or moreprogrammed computers.

Computers suitable for the execution of a computer program can be basedon general or special purpose microprocessors or both, or any other kindof central processing unit. Generally, a central processing unit willreceive instructions and data from a read-only memory or a random accessmemory or both. The essential elements of a computer are a centralprocessing unit for performing or executing instructions and one or morememory devices for storing instructions and data. The central processingunit and the memory can be supplemented by, or incorporated in, specialpurpose logic circuitry. Generally, a computer will also include, or beoperatively coupled to receive data from or transfer data to, or both,one or more mass storage devices for storing data, e.g., magnetic,magneto-optical disks, or optical disks. However, a computer need nothave such devices. Moreover, a computer can be embedded in anotherdevice, e.g., a mobile telephone, a personal digital assistant (PDA), amobile audio or video player, a game console, a Global PositioningSystem (GPS) receiver, or a portable storage device, e.g., a universalserial bus (USB) flash drive, to name just a few.

Computer-readable media suitable for storing computer programinstructions and data include all forms of non-volatile memory, mediaand memory devices, including by way of example semiconductor memorydevices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks,e.g., internal hard disks or removable disks; magneto-optical disks; andCD-ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input. In addition, a computer can interact with a user bysending documents to and receiving documents from a device that is usedby the user; for example, by sending web pages to a web browser on auser's device in response to requests received from the web browser.Also, a computer can interact with a user by sending text messages orother forms of message to a personal device, e.g., a smartphone that isrunning a messaging application, and receiving responsive messages fromthe user in return.

Data processing apparatus for implementing machine learning models canalso include, for example, special-purpose hardware accelerator unitsfor processing common and compute-intensive parts of machine learningtraining or production, i.e., inference, workloads.

Machine learning models can be implemented and deployed using a machinelearning framework, e.g., a TensorFlow framework, a Microsoft CognitiveToolkit framework, an Apache Singa framework, or an Apache MXNetframework.

Embodiments of the subject matter described in this specification can beimplemented in a computing system that includes a back-end component,e.g., as a data server, or that includes a middleware component, e.g.,an application server, or that includes a front-end component, e.g., aclient computer having a graphical user interface, a web browser, or anapp through which a user can interact with an implementation of thesubject matter described in this specification, or any combination ofone or more such back-end, middleware, or front-end components. Thecomponents of the system can be interconnected by any form or medium ofdigital data communication, e.g., a communication network. Examples ofcommunication networks include a local area network (LAN) and a widearea network (WAN), e.g., the Internet.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other. In someembodiments, a server transmits data, e.g., an HTML page, to a userdevice, e.g., for purposes of displaying data to and receiving userinput from a user interacting with the device, which acts as a client.Data generated at the user device, e.g., a result of the userinteraction, can be received at the server from the device.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or on the scope of what may be claimed, but rather asdescriptions of features that may be specific to particular embodimentsof particular inventions. Certain features that are described in thisspecification in the context of separate embodiments can also beimplemented in combination in a single embodiment. Conversely, variousfeatures that are described in the context of a single embodiment canalso be implemented in multiple embodiments separately or in anysuitable subcombination. Moreover, although features may be describedabove as acting in certain combinations and even initially be claimed assuch, one or more features from a claimed combination can in some casesbe excised from the combination, and the claimed combination may bedirected to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and recited inthe claims in a particular order, this should not be understood asrequiring that such operations be performed in the particular ordershown or in sequential order, or that all illustrated operations beperformed, to achieve desirable results. In certain circumstances,multitasking and parallel processing may be advantageous. Moreover, theseparation of various system modules and components in the embodimentsdescribed above should not be understood as requiring such separation inall embodiments, and it should be understood that the described programcomponents and systems can generally be integrated together in a singlesoftware product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Otherembodiments are within the scope of the following claims. For example,the actions recited in the claims can be performed in a different orderand still achieve desirable results. As one example, the processesdepicted in the accompanying figures do not necessarily require theparticular order shown, or sequential order, to achieve desirableresults. In some cases, multitasking and parallel processing may beadvantageous.

What is claimed is:
 1. A method performed by one or more data processingapparatus, the method comprising: obtaining sensor data captured by oneor more sensors of a vehicle; generating a sensor feature representationof the sensor data by processing an input comprising the sensor datausing an encoding neural network; obtaining data defining a spatialregion in the sensor data that has been classified as including sensordata that characterizes an object in an environment in a vicinity of thevehicle; generating an object feature representation of the object froma portion of the sensor feature representation corresponding to thespatial region; and processing an input comprising the object featurerepresentation of the object using a localization neural network togenerate a localization output characterizing a location of the objectin the environment relative to the vehicle; wherein the encoding neuralnetwork and the localization neural network have been jointly trained tooptimize a loss function by training operations comprising: using theencoding neural network and the localization neural network to generatea predicted localization output; determining gradients of the lossfunction with respect to current parameter values of the encoding neuralnetwork and the localization neural network, wherein the loss functionmeasures an error in the predicted localization output; and updating thecurrent parameter values of the encoding neural network and thelocalization neural network using the gradients of the loss function. 2.The method of claim 1, wherein the predicted localization output definesa location of a training object in a training environment relative to atraining vehicle; generating the predicted localization output comprisesprocesses training sensor data captured by one or more sensors of thetraining vehicle using the encoding neural network; and the trainingoperations further comprise training the encoding neural network togenerate auxiliary outputs characterizing a geometry of the trainingenvironment.
 3. The method of claim 2, wherein the auxiliary outputscharacterizing the geometry of the training environment comprise one ormore of: (i) an auxiliary output defining a predicted depth of thetraining environment, (ii) an auxiliary output that define spatialregions in the training sensor data that include sensor datacharacterizing objects in the training environment, or (iii) anauxiliary output defining a spatial region in the sensor data thatdefines a location of an area of ground beneath the training object. 4.The method of claim 2, wherein the training operations further comprisetraining the localization neural network to generate auxiliary outputscharacterizing the training object.
 5. The method of claim 4, whereinthe auxiliary outputs characterizing the training object comprise one ormore of: (i) an auxiliary output defining a size of the training object,(ii) an auxiliary output defining a type of the training object, (iii)an auxiliary output that defines a future location of a center of thetraining object relative to the training vehicle after a predefinedlength of time has elapsed, or (iv) an auxiliary output that defines afuture appearance of the training object after a predefined length oftime has elapsed.
 6. The method of claim 2, further comprising masking aportion of the training sensor data prior to processing the trainingsensor data using the encoding neural network.
 7. The method of claim 6,wherein the training sensor data comprises data captured by a pluralityof sensors, and masking a portion of the training sensor data comprises:masking a portion of the training sensor data captured by one or morespecified sensors of the plurality of sensors.
 8. The method of claim 6,wherein masking a portion of the training sensor data comprises settingthe portion of the training sensor data equal to a constant value. 9.The method of claim 1, wherein obtaining sensor data captured by one ormore sensors of the vehicle comprises: aligning and combining sensordata captured by one or more of: a laser sensor of the vehicle, a radarsensor of the vehicle, and a camera sensor of the vehicle.
 10. Themethod of claim 1, wherein obtaining data defining a particular spatialregion in the sensor data that has been classified as including sensordata that characterizes a particular object in the environment in thevicinity of the vehicle comprises: obtaining an object bounding box witha rectangular geometry.
 11. The method of claim 1, wherein obtainingdata defining a particular spatial region in the sensor data that hasbeen classified as including sensor data that characterizes a particularobject in an environment in a vicinity of the vehicle comprises:obtaining data generated by processing at least part of the sensor datausing an object detection neural network to generate data defining theparticular spatial region.
 12. The method of claim 1, furthercomprising: generating the object feature representation of the objectfrom a portion of the sensor data corresponding to the spatial region inaddition to the portion of the sensor feature representationcorresponding to the spatial region.
 13. The method of claim 1, whereingenerating an object feature representation of the object from a portionof the sensor feature representation corresponding to the spatial regioncomprises: cropping the portion of the sensor feature representationcorresponding to the spatial region; and transforming the croppedportion of the sensor feature representation to a fixed size using oneor more pooling operations.
 14. The method of claim 1, whereinprocessing an input comprising the object feature representation of theobject using a localization neural network to generate an outputcharacterizing a location of the object in the environment relative tothe vehicle comprises: processing the input comprising the objectfeature representation using the localization neural network to generatean output comprising coordinates characterizing a position of a centerof the object in the environment, wherein the coordinates are expressedin a coordinate system which is defined relative to the vehicle.
 15. Themethod of claim 1, wherein processing an input comprising the objectfeature representation of the object using a localization neural networkto generate an output characterizing a location of the object in theenvironment relative to the vehicle comprises: processing the inputcomprising the object feature representation using the localizationneural network to generate an output comprising a distance valuecharacterizing a distance of the object in the environment from thevehicle.
 16. A system comprising: one or more computers; and one or morestorage devices communicatively coupled to the one or more computers,wherein the one or more storage devices store instructions that, whenexecuted by the one or more computers, cause the one or more computersto perform operations comprising: obtaining sensor data captured by oneor more sensors of a vehicle; generating a sensor feature representationof the sensor data by processing an input comprising the sensor datausing an encoding neural network; obtaining data defining a spatialregion in the sensor data that has been classified as including sensordata that characterizes an object in an environment in a vicinity of thevehicle; generating an object feature representation of the object froma portion of the sensor feature representation corresponding to thespatial region; and processing an input comprising the object featurerepresentation of the object using a localization neural network togenerate a localization output characterizing a location of the objectin the environment relative to the vehicle; wherein the encoding neuralnetwork and the localization neural network have been jointly trained tooptimize a loss function by training operations comprising: using theencoding neural network and the localization neural network to generatea predicted localization output; determining gradients of the lossfunction with respect to current parameter values of the encoding neuralnetwork and the localization neural network, wherein the loss functionmeasures an error in the predicted localization output; and updating thecurrent parameter values of the encoding neural network and thelocalization neural network using the gradients of the loss function.17. One or more non-transitory computer storage media storinginstructions that when executed by one or more computers cause the oneor more computers to perform operations comprising: obtaining sensordata captured by one or more sensors of a vehicle; generating a sensorfeature representation of the sensor data by processing an inputcomprising the sensor data using an encoding neural network; obtainingdata defining a spatial region in the sensor data that has beenclassified as including sensor data that characterizes an object in anenvironment in a vicinity of the vehicle; generating an object featurerepresentation of the object from a portion of the sensor featurerepresentation corresponding to the spatial region; and processing aninput comprising the object feature representation of the object using alocalization neural network to generate a localization outputcharacterizing a location of the object in the environment relative tothe vehicle; wherein the encoding neural network and the localizationneural network have been jointly trained to optimize a loss function bytraining operations comprising: using the encoding neural network andthe localization neural network to generate a predicted localizationoutput; determining gradients of the loss function with respect tocurrent parameter values of the encoding neural network and thelocalization neural network, wherein the loss function measures an errorin the predicted localization output; and updating the current parametervalues of the encoding neural network and the localization neuralnetwork using the gradients of the loss function.
 18. The non-transitorycomputer storage media of claim 17, wherein the predicted localizationoutput defines a location of a training object in a training environmentrelative to a training vehicle; generating the predicted localizationoutput comprises processes training sensor data captured by one or moresensors of the training vehicle using the encoding neural network; andthe training operations further comprise training the encoding neuralnetwork to generate auxiliary outputs characterizing a geometry of thetraining environment.
 19. The non-transitory computer storage media ofclaim 18, wherein the auxiliary outputs characterizing the geometry ofthe training environment comprise one or more of: (i) an auxiliaryoutput defining a predicted depth of the training environment, (ii) anauxiliary output that define spatial regions in the training sensor datathat include sensor data characterizing objects in the trainingenvironment, or (iii) an auxiliary output defining a spatial region inthe sensor data that defines a location of an area of ground beneath thetraining object.
 20. The non-transitory computer storage media of claim18, wherein the training operations further comprise training thelocalization neural network to generate auxiliary outputs characterizingthe training object.